A comparison of human and GPT-4 use of probabilistic phrases in a coordination game

English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4’s estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants’ estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.

Lastly, (5) the tasks we consider-communicating information about probability and ambiguity-are intrinsically important.There is a large literature concerning human error in decision making 14 , failures in probabilistic reasoning 15 , and the consequences of these errors and failures 16,17 .
The capabilities of GPT-4 and other candidate Artificial General Intelligences have been compared to human with mixed results.GPT-4 fails simple intelligence tests 18 .On the other hand, Webb, Holyoak, and Lu 19 report that GPT-4's ability to engage in analogical reasoning and abstract pattern induction is comparable to human.Gurnee & Tegmark 20 find that it can reason about spatial and temporal structure.GPT-4 can do more than chat it can write simple computer code for applications specified in natural language 21 .
The probabilistic phrases we consider (Table 1) have been used in previous research with human participants 1 .We modify any probabilistic phrase as needed so that its use in context is grammatical (e.g.possible can become possibility).In Fig. 1 we illustrate one turn of the coordination game as a communication channel 22,23 .For simplicity we assume that, in the coordination game, the first player has only one probability and one ambiguity to signal and the second player is constrained to report a single estimate of probability and one of ambiguity.
In the full coordination game (Fig. 1), Player 1 is given a target probability (for example, 63%) and must encode it as one of the probabilistic phrases in Table 1.Perhaps she picks likely.This probabilistic phrase is transmitted to Player 2 who must decode it and estimate the target probability.In Fig. 1 she estimates 70%.The difference (7%) in absolute value between Player 1 's target probability, 63%, and Player 2 's estimate, %70, is the error, a measure of failure of coordination in the coordination game.
We will focus on the second stage (DECODE) of the coordination game (outlined in red in Fig. 1), evaluating GPT-4's performance as Player 2 and comparing GPT-4's performance to that of human participants also playing as Player 2. That is, GPT-4 and its human counterpart will be asked to DECODE probability phrases and estimate corresponding probabilities.We will also ask GPT-4 and human players to rate the ambiguity (imprecision) on a scale of 0-100 of the 23 probabilistic phrases they decode.All ambiguity estimates were done after all probability estimates by both human participants and GPT-4.Probabilistic phrases were presented in randomized order to both human participants and GPT-4.
We emphasize that, from the viewpoint of Player 2 (the human participant or GPT-4) the game is the coordination game illustrated in Fig. 1 in which Player 2 believes he is trying to coordinate a choice of probability with Player 1.In actuality the role of Player 1 is played by the experimenter who provides Player 2 with probabilistic phrases in a predetermined randomized order but Player 2 does not know that.We collect and analyze data from only Player 2, the part of Fig. 1 marked with a dashed red contour.By having the researcher play the role of Player Figure 1.The Coordination Game.On each turn in the coordination game, the First Player is given a probability p (unknown to the second player) and asked to encode it as a probabilistic phrase.Table 1 lists the probabilistic phrases that the First Player could use to encode the probability.The Second Player (either a human participant or GPT) is then given only the selected phrase and asked to estimate the original probability.
The game is a model of transmission through a communication channel, We focused on only the Decoding phase of the game (enclosed by a dashed red contour).Either GPT-4 or a human participant played the role of Player 2 while the experimenter played the role of Player 1.In a variant of the game we asked GPT-4 or the human participant to estimate not the probability but instead the ambiguity of the probability phrase.
1, we ensure comparability of input between human and GPT-4: Player 2 saw probabilistic phrases in an order determined by the experimenter alone, not the experimenter and an uncontrolled Player 1.The essence of a 2-person coordination game 2 is that the two players must each anticipate what the other player is thinking, and the anticipation is recursive: "I know that you know that I know …. ".Our game captures that essence.Player 2-human or GPT-4-is asked to see himself through the eyes of a doctor or financial advisor who is trying to communicate information about uncertainty to him.Can an LLM do this as well as a human?
There are previous studies whose participants were asked to assign explicit probabilities to probabilistic phrases 1, [24][25][26][27][28][29][30][31] .See 32 or 33 for review.These studies assess the extent to which humans-for the most part without any special training-agree with each other in their use of probabilistic phrases to signal probability.If all the speakers in a language community assign the same probabilities to probabilistic phrases, then the players would do very well at the coordination game.

Questions
1. Can GPT-4 play the coordination game as well as humans?Are there patterned deviations between human and machine?We needed a criterion to judge whether GPT-4, playing as Player 2, is doing what a human player would.We developed two criteria, the first based on linear model fits, the second on performance.
The first criterion is based on fitting a linear model to bivariate scatterplot data as we explain below.They give us clues about shortcomings of GPT-4 even when GPT-4 overall plays the game well as measured by the second criterion.We do not claim that the linear model is an adequate model of the mapping from GPT-4's estimates of probability or ambiguity to the corresponding human estimates.The fitted parameters serve as summary statistics intended to aid in interpreting the data.
Second, we evaluate how well human and GPT-4 coordinate.We develop a measure-discordance-of the extent to which human players disagree with one another in playing the game and compare this measure to the discordance between GPT-4 and human players.Does GPT-4 perform as well in the coordination game as the median human player?
There is more to language competence than assigning probability estimates and ambiguity ratings, but systematic failure to coordinate with human participants in our game would weaken any claim that GPT-4's abilities are human-like.We could not trust an Artificial General Intelligence to give medical advice if the probability phrases it uses were not correctly understood by human patients.

Is GPT-4 correctly sensitive to context?
The meanings of words can depend on the context in which they occur.If your doctor and your financial consultant both use the phrase not certain, does it signal the same probability?Ambiguity?We will include two contexts in the experiment, medical and financial, and ask participants, including GPT-4, to rate the 23 probabilistic phrases in Table 1 for probability and for ambiguity in each context.
.Humans distort small and large probabilities 33 .A doctor may well avoid probabilistic phrases near the extremes of the probability scale (e.g."almost certain") precisely because he knows his patients will distort them.Or he might bias his choice of words to counteract the expected bias of his patients.Similarly, if the Player suspects the motives of a financial adviser he might try to "debias" his estimates.
Keep in mind that the issue is not whether GPT-4's assessment of probability and ambiguity is invariant under context but whether GPT-4 exhibits the same changes or lack of change in probability and ambiguity ratings across context as do the human participants.In a coordination game it doesn't matter whether you are right, only whether you agree with everyone else.
3. Is GPT-4 stable?Lastly, we briefly investigate the stability of GPT-4 in this game.If we rerun the estimates by GPT-4, do we get series of similar estimates or a series of similar estimates with the occasional highly discrepant estimate?Is GPT-4 stable?We might hesitate to permit an Artificial General Intelligence to give medical advice if 1 time out of 100 it produced markedly discrepant estimates of probability or ambiguity.The motivation for testing stability will become clear when we examine the data.

Results
The results are presented in three numbered sections corresponding to the numbered questions above.We split the first question in two, one part (1A) concerned with probability, one with ambiguity (1B).

Human vs. GPT-4: probability
In Fig. 2 we plot the median probability ratings assigned to each of the 23 probabilistic phrases by the 25 human participants against the GPT-4 ratings of each of the 23 probabilistic phrases.Figure 2a shows results in the Investment Context while Fig. 2b shows results for the Medical Context.The ratings range from 0 to 100%.If the median human participant agreed with GPT-4 in rating probability the plotted points would fall on the dashed blue identity line.The letter codes correspond to the letter codes assigned to each probabilistic phrase in Table 1.An intercept significantly different from 0 or a slope significantly different from 1 would indicate a patterned discrepancy between GPT-4 and the median human participant.We test for both possibilities.
There is an evident outlier in Fig. 2a for the probability phrase "low risk" plotted in red.There is a similar outlier in Fig. 2b for the probabilistic phrase "not certain".The outliers represent probabilistic phrases where GPT-4 and the median human participant assigned markedly different probabilities to the same probabilistic phrase.In the main text we report statistical analyses for this and later figures without these outliers.All results of hypothesis tests-with and without outliers-are included in a Supplement.We discuss outliers further in the section Stability.
We refer to tests with p-values less than 0.05 as "significant" for convenience in presenting the data.We report exact p-values for all tests in the main text and report exact p-values for all tests (including those with and without outliers removed) in the Supplement.Summary There are significant patterned differences between median human probability estimates and those of GPT-4.In both contexts median human estimates of probability are compressed by a factor of 0.8 relative to the estimates by GPT-4.In the Medical Context but not in the Investment, human estimates of probabilities are also offset vertically by roughly 10%.Human use of probability and relative frequency are typically distorted 34,35 and the deviations we detect may be connected to probability distortion.
Discordance.Both the human participants and GPT-4 are engaged in a coordination game and the second criterion of human and machine is trial by trial winnings in the game.Did GPT-4 disagree with the other human human players more than they disagreed with one another?
We define a measure of the disagreement between the probability or ambiguity estimates of the human participants.Let p i = [p i1 ,…p im ] be the vector containing the m = 23 probability estimates of the ith human participant in the order of Table 1.We define the discordance of the ith human participant to be.
where p i − p j = m k=1 p ik − p jk 2 denotes the Euclidean distance between p i and p j .The discordance of a participant is just the sum of the squared distances between the vector corresponding to the participant and each of the vectors corresponding to the remaining participants.It can be zero only if all the participants give identical estimates for all probability phrases.Let p GPT denote the vector of probability estimates made by GPT-4 and define the discordance of GPT-4 to be.
There are m − 1 summands in Eq. (1) and m in Eq. ( 2).The multiplicative term m−1/m in Eq. ( 2) corrects for the difference in the number of summands in the two equations.
Were GPT-4's judgments of probability more discrepant from those of the human participants than those of the human participants were from one another?Figure 3a is a boxplot of all the discordance values for the Investment Context, one blue dot per human participant.The discordance values are plotted vertically, and the red horizontal line marks the median discordance in each context.The lower and upper edges of the box mark the 25th-percentile and the 75th-percentile, respectively.Figure 3b is the corresponding plot for the Medical Context.The red diamonds mark the discordances of GPT-4 in the two contexts.The discordance of GPT-4 is below the median of the discordances for the humans for both Contexts.GPT-4 agreed with the human participants as as least as well as they agreed with one another. (1)

Human vs. GPT-4: ambiguity
In Fig. 4 we plot the median ambiguity ratings of the 25 human participants to the GPT-4 ratings of each of the 23 probabilistic phrases.Figure 4A shows results in the Investment Context while Fig. 4B shows results for the Medical Context.The letter codes once again correspond to the letter codes assigned to each probabilistic phrase in Table 1.
Intercept The Intercept estimate in Fig. 4A    Summary There are significant patterned differences between median human confidence estimates and those of GPT-4.In both contexts median human estimates are compressed, by a factor of 0.5 to 0.7 relative to the estimates by GPT-4.
We did not analyze discordance for ambiguity since there are evident large differences in estimation of ambiguity by GPT-4 and the median human participant.

Comparisons across context
We next evaluate the extent to which human judgments of probability and ambiguity are invariant across context.In Fig. 5A we plot the median probability ratings of the human participants in the Investment Context versus the median probability ratings for each of the 23 probabilistic phrases of a different group of human participants in the Medical A similar plot for GPT-4 is shown in Fig. 5B. Figure 6A,B show corresponding plots for ambiguity.Summary There are significant patterned differences between ambiguity estimates by GPT-4 in the Investment Context and the Medical Context.

Stability
The two outliers in GPT-4's performance raise issues concerning the stability of GPT-4.We chose to examine the outlier in Fig. 2B (the probabilistic phrase not certain) to determine whether it reliably recurs (representing a large but reliable discrepancy between human and GPT-4 estimates) or whether it is evidence of instability.If it reliably reoccurs then it is effectively a difference of opinion between human and machine as to the meaning of a particular probabilistic phrase.If not, it would suggest that GPT-4 is unstable.
GPT-4 included explanations for its responses.We tabulate these explanations for the outlier in Fig. 2A (low risk) in the Investment Context and the outlier in Fig. 2B (not certain) in the Medical Context.The reader may agree with GPT-4 or not, but GPT-4's response acknowledges that the probabilistic phrases can be interpreted in more than one way and perhaps human and machine are simply in disagreement.
Will the outlier recur if we rerun the trial?The GPT-4 interface that we have access to limits the number of runs that we can carry out in a fixed period of time, precluding analyses that require large numbers of repetitions of trials.We redid the GPT-4 estimates in the Medical Context four times, plotting the estimates as four blue contours in Fig. 7.The original estimates are plotted in red with a red solid circle marking the outlier.In brief, we did not reproduce the anomalous outlier we initially encountered nor did other outliers emerge for any of the other probabilistic phrases.The four new estimates of probability are in good agreement with those of the human participants and with one another but not with the original GPT-4 estimate.

Discussion
In coordination games we share information to coordinate actions 5,6 .The specific coordination game we consider here concerns correct use of probabilistic phrases signaling probability and ambiguity.There were two versions of the game, one with probabilistic phrases used to give investment advice, one with these same phrases used to give medical advice.Half the human participants ran in the Investment Context, half in the Medical.www.nature.com/scientificreports/

Conclusions: Estimates of Probability
• As measured by discordance, GPT-4 agreed with human participants as least as well as the median human participant agreed with the other participants.Based on overall performance we cannot distinguish GPT-4 and human (Fig. 3).• Examined in detail (linear model fits), we found significant patterned discrepancies between GPT-4's esti- mates of probability and those of human participants that could be captured by fits to a linear model (Fig. 2).
In both contexts, human estimates of probabilities tended to be compressed relative to those of GPT-4.• Use of probabilistic phrases by both GPT-4 and humans signaling probability transferred well across contexts for the two contexts we considered (Fig. 5).A doctor's use of likely conveys the same information about probability as that of an investment consultant.

Conclusions: Estimates of Ambiguity
Human estimates of ambiguity were compressed relative to those of GPT-4 by roughly a factor of 2. However, unlike probability, there is no standard scale of ambiguity.We can only conclude that GPT-4 did not anticipate human use of the ambiguity scale, a failure to coordinate.
There is some indication that GPT-4 is unstable, producing occasional outliers.Further research is needed to evaluate this apparent instability (Table 2).
We focused on one coordination game and compared human and machine.Similar games could be based on color terms or sets of dimensional adjectives 36 , for example, the dimensional adjectives describing size: small, big, large, etc 36,37 .Gurnee & Tegmark 20 look at representation of space and time.
But human use of probability phrases is a particularly rich source of possible coordination games that we could use to compare human and machine.We can challenge GPT-4 to play each game we develop, comparing human and machine as we did here.
When, for example, do humans use probability phrases and when do they use numerical probability?Dhami & Mandel 38 in their review article argue that the choice between the use of numerical or verbal probabilities by senders is influenced by several factors.For example, Juanchich and Sirota 39 find that, in the medical context, senders prefer to use numerical values when uncertainty is about very consequential events, as, for example, the serious side effects of a drug.Would GPT-4 have similar preferences?
Wallsten et al. 40 found that most people preferred to receive information about the probability of a chance event in numerical form but preferred to transmit this information as a probabilistic phrase.Erev & Cohen 41 referred to this pattern of preference as the Communication Mode Preference Paradox.Whatever justification we offer for transmitting a probabilistic phrase instead of a numerical probability would seem to apply to receiving it in the same form, an apparent paradox.Would GPT-4 exhibit the same paradox?
Senders' use of verbal probabilities has several effects other than conveying an estimate of uncertainty 38 .Honda 42 found that the use of verbal probabilities, for example using positive rather than negative terms, can introduce bias into the decison making process.Verbal probabilities can also be used as receiver's and sender's face-saving strategy [43][44][45] .Does GPT-4 exhibit similar biases?
Comparing GPT-4 to human in these coordination games provides a systematic way to assess what GPT-4 (or any other LLM) can and cannot do, where its performance matches, exceeds, or falls short of, human.There is more to language than a series of coordination games but such games provide a scaffolding allowing us to describe what GPT-4 does in a principled way.Table 2. GPT-4's Justifications of Estimated Probabilities.GPT-4's probability estimates for the two different contexts were discrepant for probabilistic phrases "low risk " and "not certain ".GPT-4's output includes an explanation of each choice which we include here.

Context Investment Medical
"Low risk" Estimate: 85% Estimate:10% Justification: When you describe something as "low risk" in the context of investments increasing in value, you're conveying that there's a high likelihood the investments will perform positively or, at the very least, not decline significantly.Given the options provided, and interpreting "low risk" as a high probability of the investments increasing in value, I would estimate the communicated probability to be: 85%.This reflects that there's a high likelihood of positive performance, but as with all investments, there's never a guarantee.Always remember that while terms like "low risk" are commonly understood in finance, their interpretation can still vary somewhat based on individual perception and context." Justification: 'If I use the phrase "low risk" to describe the probability of a headache occurring as a side effect, I'm suggesting that it's not a common outcome, but there's still some possibility.Given the list of percentages you provided, "10%" seems to be a good representation of what "low risk" might imply in this context." "Not certain" Estimate: 50% 95% The phrase "not certain" is somewhat ambiguous.While it indicates that there isn't a guarantee, it doesn't provide a clear indication of how likely or unlikely the event is.Given its vagueness, but still leaning toward the idea that there's less confidence than "certain," a neutral probability might be represented.Therefore, I would estimate the communicated probability for "not certain" to be: 50%.However, it's essential to understand that this phrase leaves a lot to interpretation, and different people might infer different probabilities from it The phrase "not certain" is ambiguous and indicates that there's uncertainty, but it doesn't necessarily provide a clear indication of how probable or improbable the event is.It could be slightly probable or highly probable, but just not 100% certain.Given the options provided and the context, "95%" might be the best choice to convey that there's a high likelihood, but it's not certain.However, it's important to note that the phrase "not certain" doesn't inherently align with any specific percentage; the chosen value is more of an educated guess based on the given options 14:6835 | https://doi.org/10.1038/s41598-024-56740-9www.nature.com/scientificreports/Intercept The Intercept estimate in Fig. 2a is 1.59, not significantly different from 0 [t (22) = 0.421, p = 0.339].The Intercept estimate in Fig. 2b is 11.35, significantly different from human 0 [t (22) = 3.458, p = 0.0011].Slope The Slope estimate in Fig. 2a is 0.833, significantly different from 1 [t (22) = − 2.577, p = 0.009].The Slope estimate in Fig. 2b is 0.825, also significantly different from 1 [t (22) = − 2.933, p = 0.0038].

Figure 2 .
Figure 2. Human versus GPT-4.(a) Median human estimates of probability in the Investment Context are plotted versus GPT-4 estimates of probability in the same context (blue filled circles).A letter code adjacent to the blue filled circle identifies the probabilistic phrase associated with each circle.See Table 1.One outlier (D low risk) is marked with its probabilistic phrase in red.See text.(b) Median human estimates of probability in the Medical Context are plotted versus GPT-4 estimates of probability.One outlier (N not certain) is marked with its probabilistic phrase in red.See text.

Figure 3 .
Figure 3. Discordance.We computed discordance, a measure of disagreement among each human observer and the remaining human observers and between GPT-4 and the human observers.See text.The left and right panels are boxplots of discordance values for the Investment Context and for the Medical Context, respectively.The top and bottom of the boxes mark the 75th and 25th percentiles for each context.The discordance for GPT-4 is marked by a solid red diamond in each context.The discordance for GPT-4 is below the median discordance (the solid red line segment) for the human participants in both contexts.

Figure 4 .
Figure 4. Human versus GPT-4.(a) Median human estimates of ambiguity in the Investment Context are plotted versus GPT-4 estimates of ambiguity in the same context as blue filled circles.The format is analogous to that of Fig. 2a,b.Median human estimates of ambiguity in the medical context are plotted versus GPT-4 estimates of ambiguity in the same context.The format is analogous to that of Fig. 2b.

Figure 5 .
Figure 5.The effect of context.(a) Comparison of probability estimates of the median human observer across contexts.The format is analogous to that of the corresponding panels in Fig. 2a.All data fall roughly along the identity line.Human participants select lower probabilities for the same probabilistic phrase in the Medical Context.(b) Comparison of GPT-4 probability estimates across contexts.The format is analogous to that of Fig. 2b.All data fall roughly along the identity line.The same two outliers appear in Fig. 4b as in Fig. 2a,b.

Figure 6 .
Figure 6.The effect of context.(b) The median estimates of in the Investment Context are plotted versus the median of human estimates of ambiguity in the Medical Context.The format is analogous to that of Fig. 5a.Human estimates are significantly higher in the medical context.See text.(b) The median of GPT-4 estimates of ambiguity in the Investment Context are plotted versus the median of GPT-4 estimates of ambiguity in the Medical Context.GPT-4 estimates are significantly higher in the medical context.See text.

Figure 7 .
Figure 7. Stability of GPT-4 estimates in the Medical Condition.Figure2b(GPT-4 estimates of probability in the Medical Context) contains an evident outlier not certain whose estimate in the Investment Condition is 50%, in the Medical Context, 95%.We plot the GPT-4 estimates from the Medical Context in red with the outlier marked by a red disk.We repeat these GPT-4 estimates four times and plot them as well, in blue.The contours are slightly offset vertically so that they are visible.The outlier does not reoccur, and the four repetitions are in good agreement with each other and the original estimates except for the outlier.See text.

Table 1 .
Probabilistic phrases taken from 1 .Each phrase has a code letter used in plotting results.Each phrase was embedded in a Medical Context and an Investment Context.Half the participants chosen at random completed the Medical Context, the remaining participants completed the Investment Context.